feature dimension
A Proofs
Section A.1 presents the lemmas used to prove the main results. Section A.2 presents the main results The first two inequalities are owing to the triangle inequality, and the third inequality is due to the definition of L-divergence Eq.(5). We complete the proof by applying Lemma A.1 to bound F ollowing the conditions of Theorem 4.1, the upper bound of null V arnull null D Based on the conditions of Theorem 4.1, we assume We complete the proof by applying Lemma A.3 and Lemma A.4 to bound the Rademacher Following the proof of Theorem 4.1, we have |D F ollowing the conditions of Proposition 4.3, as N, we have, null D Based on the result on Proposition 4.3, for any δ (0, 1), we know that 4LB ( 2 D ln 2 + 1)null We complete the proof by applying the triangle inequality. III: Samples from p and q are labeled with 0 and 1, respectively. All values are averaged over five trials.
- North America > United States > New York > Monroe County > Rochester (0.04)
- North America > Canada > British Columbia > Metro Vancouver Regional District > Vancouver (0.04)
- Asia > China > Beijing > Beijing (0.04)
- Asia > China > Anhui Province > Hefei (0.04)
Supplementary Materials for " DropCov: A Simple yet Effective Method for Improving Deep Architectures " Qilong Wang
Our proposed DropCov can be flexibly integrated with existing deep architectures (e.g., CNNs [ Qinghua Hu is the corresponding author and is with Engineering Research Center of City intelligence and Digital Governance, Ministry of Education of the People's Republic of China. VGG-VD on three small-scale fine-grained datasets) show 0.5 is the best choices of As listed in Table S2, we can see that single L T module brings a little gain for plain GCP . Compared to B-CNN + L T (79.62% training accuracy), plain GCP GCP + L T, while B-CNN + L T achieves significant improvement over B-CNN and plain GCP . On the contrary, the samples involving less redundant information (e.g., scene) have large Such these phenomena show the consistency with our finding. Is second-order information helpful for large-scale visual recognition?
- Asia > China > Tianjin Province > Tianjin (0.05)
- Asia > China > Liaoning Province > Dalian (0.04)
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > California > Santa Clara County > Stanford (0.04)
- North America > United States > California > Los Angeles County > Long Beach (0.04)
- (5 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.96)
- Information Technology > Artificial Intelligence > Natural Language > Machine Translation (0.68)
- Asia > Taiwan (0.05)
- Europe > Portugal > Aveiro > Aveiro (0.04)
- Europe > Greece > Central Macedonia > Thessaloniki (0.04)
- (7 more...)
Algorithm 1: Pseudocode of PIC in a PyTorch-likestyle
LinearEvaluationProtocol Inlinear evaluation, wefollowthecommon setting [6,5]tofreeze the backbone of ResNet-50 and train a supervised linear classifier on the global average pooling features for100 epochs. Note that, the2-layer head inunsupervised pre-training isnotused inthe linear evaluation stage. During training, we augment the image with random scaling from 0.5 to 2.0, crop size of 769 and random flip. The top-1 and top-5 accuracyresults are reported inTable9. From the perspective of optimization goals, the only difference between the parametric instance classification framework and supervised classification framework is how to define the classes for each instance.
Addressing divergent representations from causal interventions on neural networks
Grant, Satchel, Han, Simon Jerome, Tartaglini, Alexa R., Potts, Christopher
A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate theoretically and empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two cases of such divergences: "harmless" divergences that occur in the behavioral null-space of the layer(s) of interest, and "pernicious" divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to mitigate the pernicious cases, we apply and modify the Counterfactual Latent (CL) loss from Grant (2025) allowing representations from causal interventions to remain closer to the natural distribution, reducing the likelihood of harmful divergences while preserving the interpretive power of the interventions. Together, these results highlight a path towards more reliable interpretability methods.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.04)
- North America > United States > California > Santa Clara County > Palo Alto (0.04)
- North America > Mexico > Mexico City > Mexico City (0.04)
- Asia > Thailand > Bangkok > Bangkok (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (0.93)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.93)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.93)
FedAPA: Federated Learning with Adaptive Prototype Aggregation Toward Heterogeneous Wi-Fi CSI-based Crowd Counting
Guo, Jingtao, Mao, Yuyi, Ho, Ivan Wang-Hei
Wi-Fi channel state information (CSI)-based sensing provides a non-invasive, device-free approach for tasks such as human activity recognition and crowd counting, but large-scale deployment is hindered by the need for extensive site-specific training data. Federated learning (FL) offers a way to avoid raw data sharing but is challenged by heterogeneous sensing data and device resources. This paper proposes FedAPA, a collaborative Wi-Fi CSI-based sensing algorithm that uses adaptive prototype aggregation (APA) strategy to assign similarity-based weights to peer prototypes, enabling adaptive client contributions and yielding a personalized global prototype for each client instead of a fixed-weight aggregation. During local training, we adopt a hybrid objective that combines classification learning with representation contrastive learning to align local and global knowledge. We provide a convergence analysis of FedAPA and evaluate it in a real-world distributed Wi-Fi crowd counting scenario with six environments and up to 20 people. The results show that our method outperform multiple baselines in terms of accuracy, F1 score, mean absolute error (MAE), and communication overhead, with FedAPA achieving at least a 9.65% increase in accuracy, a 9% gain in F1 score, a 0.29 reduction in MAE, and a 95.94% reduction in communication overhead.
SAS: Simulated Attention Score
Zheng, Chuanyang, Sun, Jiankai, Gao, Yihang, Wang, Yuehao, Wang, Peihao, Xiong, Jing, Ren, Liliang, Cheng, Hao, Kulkarni, Janardhan, Shen, Yelong, Wang, Atlas, Schwager, Mac, Schneider, Anderson, Liu, Xiaodong, Gao, Jianfeng
The attention mechanism is a core component of the Transformer architecture. Various methods have been developed to compute attention scores, including multi-head attention (MHA), multi-query attention, group-query attention and so on. We further analyze the MHA and observe that its performance improves as the number of attention heads increases, provided the hidden size per head remains sufficiently large. Therefore, increasing both the head count and hidden size per head with minimal parameter overhead can lead to significant performance gains at a low cost. Motivated by this insight, we introduce Simulated Attention Score (SAS), which maintains a compact model size while simulating a larger number of attention heads and hidden feature dimension per head. This is achieved by projecting a low-dimensional head representation into a higher-dimensional space, effectively increasing attention capacity without increasing parameter count. Beyond the head representations, we further extend the simulation approach to feature dimension of the key and query embeddings, enhancing expressiveness by mimicking the behavior of a larger model while preserving the original model size. To control the parameter cost, we also propose Parameter-Efficient Attention Aggregation (PEAA). Comprehensive experiments on a variety of datasets and tasks demonstrate the effectiveness of the proposed SAS method, achieving significant improvements over different attention variants.
- North America > United States (0.14)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
Short-Range Oversquashing
Mishayev, Yaaqov, Sverdlov, Yonatan, Amir, Tal, Dym, Nadav
Message Passing Neural Networks (MPNNs) are widely used for learning on graphs, but their ability to process long-range information is limited by the phenomenon of oversquashing. This limitation has led some researchers to advocate Graph Transformers as a better alternative, whereas others suggest that it can be mitigated within the MPNN framework, using virtual nodes or other rewiring techniques. In this work, we demonstrate that oversquashing is not limited to long-range tasks, but can also arise in short-range problems. This observation allows us to disentangle two distinct mechanisms underlying oversquashing: (1) the bottleneck phenomenon, which can arise even in low-range settings, and (2) the vanishing gradient phenomenon, which is closely associated with long-range tasks. We further show that the short-range bottleneck effect is not captured by existing explanations for oversquashing, and that adding virtual nodes does not resolve it. In contrast, transformers do succeed in such tasks, positioning them as the more compelling solution to oversquashing, compared to specialized MPNNs.